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Abstract —Background subtraction has been a fundamental 
and widely studied task in video analysis, with a wide range 
of applications in video surveillance, teleconferencing and 3D 
modeling. Recently, motivated by compressive imaging, back¬ 
ground subtraction from compressive measurements (BSCM) 
is becoming an active research task in video surveillance. In 
this paper, we propose a novel tensor-based robust PCA (Ten- 
RPCA) approach for BSCM by decomposing video frames into 
backgrounds with spatial-temporal correlations and foregrounds 
with spatio-temporal continuity in a tensor framework. In this 
approach, we use 3D total variation (TV) to enhance the spatio- 
temporal continuity of foregrounds, and Iticker decomposition 
to model the spatio-temporal correlations of video background. 
Based on this idea, we design a basic tensor RPCA model over 
the video frames, dubbed as the holistic TenRPCA model (H- 
TenRPCA). To characterize the correlations among the groups 
of similar 3D patches of video background, we further design a 
patch-group-based tensor RPCA model (PG-TenRPCA) by joint 
tensor Tucker decompositions of 3D patch groups for modeling 
the video background. Efficient algorithms using alternating 
direction method of multipliers (ADMM) are developed to solve 
the proposed models. Extensive experiments on simulated and 
real-world videos demonstrate the superiority of the proposed 
approaches over the existing state-of-the-art approaches. 

Index Terms —Background subtraction, compressive imaging, 
video surveillance, robust principal component analysis, tensor 
decomposition, 3D total variation, nonlocal self-similarity. 
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I. Introduction 

Since 1990s, background subtraction 0-0 has been at¬ 
tracting great attention in the fields of image processing and 
computer vision. It aims at simultaneously separating video 
background and extracting the moving objects from a video 
stream, which provides important cues for numerous applica¬ 
tions such as moving object detection object tracking in 
surveillance in, etc. 

Most of the current video background subtraction tech¬ 
niques consist of four steps: video acquisition, encoding, 
decoding, and separating the moving objects from back¬ 
ground Q. For example, Lamarre and Clark performed 
background subtraction on JPEG encoded video frames using a 
probabilistic model; Aggarwal et al. fTQ| considered detecting 
moving objects on a MPEG-compressed video using DCT 
coefficients of video frames. These conventional approaches 
commonly implement video acquisition, coding, and back¬ 
ground subtraction in separate procedures. This conventional 
scheme requires to fully sample the video frames with large 
storage requirements, followed by well-designed video coding 
and background subtraction algorithms. Recently, motivated 
by compressive sensing (CS) GD-GD in signal process¬ 
ing, we focus on a newly-developed compressive imaging 
scheme GD-GZ) for background subtraction by combining 
the video acquisition, coding and background subtraction into 
a single framework, which is called background subtrac¬ 
tion from compressive measurements (BSCM). Figure 
shows an illustrative example. The video imaging system 
first captures compressive measurements from the scenes, and 
then transmits these measurements to the processing center 
for foreground/background reconstruction. Compared to the 
conventional scheme, this new scheme need not fully sense all 
the video voxels, and thus heavily reduces the computational 
and storage costs and even the energy consumption of imaging 
sensors. 

The task of the BSCM is to reconstruct the original video 
with high fidelity and meanwhile accurately separate the 
moving objects from video background based on compressive 
measurements. The objective on this task is to maximize 
the reconstruction and separation accuracies using as few 
compressive measurements as possible. This is a heavily ill- 
posed inverse problem and it is necessary to discover the 
video prior knowledge to make this problem well-posed. There 
already exist some works |T^-p2| on the task of the BSCM. 
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Fig. 2. Illustration of video priors. Please see the text for details, (a) Non-local self-similarity prior of video background. A 3D patch has many similar 3D 
patches in video background, (b) The spatial-temporal continuity of video foreground, (c) Spatial-temporal correlation of video background. The red and blue 
curves show the singular values of two matrices, i.e., one matrix with columns of vectorized video background frames and another matrix as one frame from 
the video background. These two curves indicate the strong temporal correlation and moderate spatial correlation, (d) Spatial correlations in more natural 
images. 



Fig. 1. The framework of the compressive sensing surveillance system. 

The first seminal work was proposed by Cevher et al. 
in which the dynamic adaptation of background constraint 
and foreground reconstruction are gracefully handled. Then, 
Waters et al. p9| observed that the frames in video background 
possess strong temporal correlation and the moving objects 
often occupy a small region in video foreground, and proposed 
a robust principal component analysis (RPCA) model to cope 
with this task. Guo et al. p0| further proposed an online 
algorithm that utilizes the spatial continuity of the supports 
of moving objects in video foreground. Jiang et al. (ZD (22) 
proposed a reconstruction model in which the sparsity of 
video foreground in the transform domain is considered. We 
noted that, first, all of these approaches model and characterize 
different video priors in a matrix framework. Second, although 
these algorithms have achieved good performance, more fine 
video priors of background and foreground have not been fully 
discovered. Thus, more potential algorithms can be developed. 

In this work, we propose a tensor robust principal com¬ 
ponent analysis (TenRPCA) approach for the task of the 


BSCM. In this framework, we take the video frames or video 
patches as the high-order tensors, and extend the robust PC A 
approach for matrix to the tensor-based video representation 
by fully investigating the domain-specific prior knowledge 
of surveillance videos for regularizing this inverse problem. 
Compared to the matrix representation of surveillance video 
that represents each frame as a vector, this tensor-based video 
representation directly takes a video frame as a matrix slice 
in a tensor, which preserves the spatial and temporal structure 
of the surveillance video. 

As shown in Fig. we observed three types of priors 
for most surveillance videos with static backgrounds, i.e., the 
nonlocal similarity of 3D patches in video background, the 
spatio-temporal continuity of video foreground, and the spatio- 
temporal correlation in video background. First, as shown in 
Fig. I^a), a 3D patch in video background possesses many 
similar 3D patches over the video background, and each group 
of similar 3D patches has strong correlation. This property 
is termed as nonlocal self-similarity of video background. 
Second, as shown in Fig. |^b), the moving car in video 
foreground is spatially continuous in both its support regions 
and its intensity values in these regions. Moreover, the moving 
car is also temporally continuous among succeeding frames. 
We term this prior as spatio-temporal continuity of video 
foreground. Third, the video backgrounds are spatially and 
temporally correlated. In Fig. [^c), we show two curves (red 
and blue) of the singular value^ of two matrices, i.e., one 
matrix with columns of vectorized video background frames 

^For better illustration in Fig. ic), we normalize the singular values of a 
matrix by enforcing their summation to be one. 
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and another matrix as one frame from the video background. 
The drastically decaying trend of the red curve indicates 
the strong temporal correlation among the video background 
frames and the slow decaying trend of the blue curve indicates 
the weak correlation in the spatial domain. Let us define the 
accumulation energy ratio of top k normalized singular values 
as AccEgyR = nsvi, where nsvi is the i-th normalized 
singular value, i.e., nsVi = svi/ svi and svi is the i-th 
singular value. The arrow box for the red curve indicates that 
only top 3 singular values can attain the ratio 0.9030 while 
the arrow box for the blue curve indicates that top 84 singular 
values can attain the ratio 0.9770. These quantitative values 
justify that the video background has strong correlation among 
its frames and each video background frame has weak spatial 
correlation. Fig. [^d) further exhibits the weak correlations 
in other natural images. We term this observation as spatio- 
temporal correlation of video background. 

Based on the aforementioned video priors, we model the 
BSCM in a tensor RPC A framework (TenRPCA) using Tucker 
decomposition technique. With our model, a video volume 
represented by a tensor is decomposed into a background 
layer based on its spatial-temporal correlation and foreground 
layer based on its spatial-temporal continuity. We design a 
Tucker decomposition approach to model the spatio-temporal 
correlation in video background and a 3D total variation 
(TV) term to enforce the spatio-temporal continuity of video 
foreground. Along this idea, we propose two TenRPCA models 
by representing video background as a single tensor and a few 
patch-level tensors over groups of similar 3D patches, which 
are dubbed as holistic tensor RPCA model (H-TenRPCA) 
and patch-group-based tensor RPCA model (PG-TenRPCA) 
respectively. We design efficient algorithms using the alter¬ 
nating direction method of multipliers (ADMM) to optimize 
these proposed models. The experiments on synthetic and real 
videos demonstrate that our proposed two TenRPCA models 
achieve higher reconstruction and background/foreground sep¬ 
aration accuracies with fewer compressive measurements than 
the existing state-of-the-art approaches. Moreover, the PG- 
TenRPCA model generally works better than the H-TenRPCA 
model which indicates the effectiveness of our modeling of 
the nonlocal self-similarity of video background. 

Our contributions can be summarized as four folds: First, 
to the best of our knowledge, we are the first to model the 
BSCM task in a tensor robust PC A framework. Compared to 
the matrix-based video representation, this tensor-based video 
representation well preserves the spatial-temporal structures 
of video, which enables us to fully characterize the priors of 
video spatial-temporal structures in our framework. Second, 
we fully investigate the video priors for the BSCM task. We 
design a 3D total variation (TV) term to encode the spatio- 
temporal continuity of video foreground, and a Tucker decom¬ 
position approach to model the spatio-temporal correlation of 
video background. Third, based on the observation of nonlocal 
self-similarity of video background, we design a patch-level 
background model using joint Tucker decomposition over 
groups of similar 3D patches to model the strong correlations 
among similar 3D patches. This model significantly outper¬ 
forms our holistic TenRPCA model which represents the video 


background as a single tensor. Finally, based on ADMM with 
the adaptive scheme, we design efficient algorithms to solve 
the proposed models, and achieve superior performance over 
the existing methods on various video data sets, especially 
when the sampling ratio is very low. 

The remaining of this paper is organized as follows. In 
Section II, the related works will be discussed. In Section III, 
the general framework of the BSCM will be reviewed. Our 
models and their motivations will be presented in Section IV. 
In Section V, efficient algorithms will be designed to solve 
the proposed models. In Section VI, extensive experiments 
on various surveillance video data sets will be conducted to 
substantiate the superiority of the proposed models over the 
other existing ones. This paper will be concluded with some 
discussions on future work in Section VII. 

II. Related work 

A. Background Subtraction without Compressive Imaging 

Various approaches for background subtraction using con¬ 
ventional imaging cameras have been developed since 1990s 
and obtained a wide range of applications in many fields. 
These approaches can be mainly categorized into the following 
five classes: the basic approach, the statistical approach, the 
fuzzy approach, the neural and neuro-fuzzy approach, and the 
subspace learning approach |1||-|[^. 

Among these traditional approaches, the subspace learning 
approach has been attracting wide attentions in the field of 
machine learning and computer vision. One classical work on 
this task was proposed by Oliver et al. | [2^ , which uses an 
eigenspace (PCA) idea to model the background. Aiming at 
remedying the outlier and heavy noise issue, Candes et al. p4| 
proposed robust principal component analysis (RPCA) to resist 
the gross sparse noise. This seminal work has triggered a 
tremendous interest in dealing with background subtraction 
using different formulations of RPCA. For example, the 
Markov random field (MRF) regularized RPCA technique was 
proposed in Zhou et al. flSl , a novel block sparse RPCA 
formulation was proposed in | [26| , total variation regularized 
RPCA and matrix factorization methods were respectively 
proposed in Cao et al. p7| and Guo et al. 1^ , and the prob¬ 
abilistic versions of RPCA were proposed in Ding et al. p9| 
and Babacan et al. p0| , respectively. In the recent work, Zhao 
et al. (D proposed a new probabilistic variant by extracting 
multi-layer structures with certain physical meanings using the 
mixture of Gaussians (MOG). 

To meet the real-time requirements in practical applications, 
various online subspace learning approaches were developed. 
Rymel et al. and Li et al. respectively proposed 
an incremental PCA method to handle the newly coming 
video streams. By constraining the subspace on Grassmannian 
manifold, Balzano et al. p4|-p7| proposed two efficient 
approaches named GROUSE and GRASTA respectively, to 
deal with online subspace identification and tracking (SIT) 
task. Additionally, it was reported that the proposed GROUSE 
and GRASTA can effectively achieve the real-time background 
subtraction through sampling the voxels of video sequence. Xu 
et al. p8| further proposed an updated version of GRASTA 
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by modeling the contiguous structure of supports of video 
foreground using group sparsity. Chi et al. p9| developed 
an online parallel SIT algorithm using recursive least squares 
technique for real-time background subtraction. 

B. Background Subtraction with Compressive Imaging 

Recently, multiple studies have been carried out for the 
background subtraction problem from the perspective of com¬ 
pressive imaging, in which it is required to simultaneously 
perform background subtraction and video reconstruction. The 
first seminal work was considered by Cevher et al. in 
which the dynamic adaptation of background constraint and 
foreground reconstruction are gracefully handled. Recently, 
based on the theoretical results of li-li minimization | |4Ql , 
Mota et al. ED proposed an efficient adaptive-rate algorithm 
to deal with the BSCM task. Additionally, a series of work 
have been proposed based on the matrix RPC A technique. 
Waters et al. 03 integrated the matrix RPCA methodology 
into the framework of the BSCM and then developed a greedy 
algorithm called SpaRCS to solve the resulting model. Guo et 
al. pot developed an online RPCA algorithm that models the 
spatial continuity prior of moving objects in the foreground. 
Jiang et al. p2t proposed a new RPCA model in which 
the sparsity of video foreground in the transform domain is 
considered based on certain practical requirements. 

The matrix RPCA approaches for the BSCM commonly 
model the video as a matrix with columns of vectorized 
video frames. Although the matrix RPCA methodology has 
been an increasingly useful technique, it fails in fully ex¬ 
ploiting the prior knowledge on the intrinsic structures of 
video after vectorizing the video frames. Our proposed tensor 
RPCA approach considers more extensive spatio-temporal 
prior knowledge of video background and foreground using 
tensor representation of video. Such full utilization of prior 
information makes our approach capable of achieving a better 
video reconstruction quality and simultaneously detecting the 
moving objects in foreground from a limited number of 
compressive measurements, as will be shown in Section VI. 
We also noted that the tensor compressive sensing models 
were recently proposed in | [42| , | [43| . But they are significantly 
different from our models, because these model are designed 
for the image/video compressive sensing task instead of the 
more complex BSCM task considered in this paper. 

III. The General Framework of the BSCM 

In this section, we will present the general framework 
for the BSCM task. We will mainly focus on the mathe¬ 
matical modeling and algorithm design in the pipeline of 
BSCM, i.e., reconstruct video foreground and background 
from compressive measurements. In the followings, we will 
introduce the basic components of the BSCM task, including 
the representation of video volume, compressive operator, and 
video reconstruction and separation. 

A. Video Volume 

Video frames within a short period are collected as a 
video volume. If the video frame has a single channel, then 


the video volume can be represented as a 3-order tensor 
Ao := {Xj,Xo, ...,X^}, where each matrix Xq G = 

1, 2, • • • , U) represents i-th frame. H and W denote the height 
and width of a frame and D denotes the number of frames. 
This tensor has 3 modes including height, width and time. We 
assume that the video volume to be reconstructed can be sepa¬ 
rated into a static component (video background) A'l, and a dy¬ 
namic component (video foreground) ^^ 2 , i.e., A'o := A'l H-^^ 2 , 
where JVi := {X}, XL ■ ■ , Xf} and A’z := {X^, X^ ■ ■ ■ , Xf}. 
In the following, we denote the vectorization of a video 
volume A'o by Xq := [xj; Xq; • • • , x^], and the vectorization of 
video background and foreground by xi := [xj;xi; • • • ,xf] 
and X 2 := [x 2 ;x 2 ; • • • ,x^], respectively. 

B. Compressive Operator 

Compressive operator can be considered as the effective 
encoding of video volume. Currently, how to design a high 
quality compressive operator is a crucial research topic in 
the CS community; see | [47| . For video data, the compressive 
measurements y can be obtained by 

y = .4(xo), (1) 

where y is a vector of length M, and A indicates a given 
compressive operator. 

In this work, the randomly permuted Walsh-Hadamard op¬ 
erator | [2T| and the randomly permuted noiselet operator | [44| 
will be employed as compressive operators because of their 
low computational cost and easy hardware implementation. 
Compressive operator can be instantiated as ^ = D • H • P, 
where P is a random permutation matrix, H is the Walsh- 
Hadamard transform or the noiselet transform, and D is a ran¬ 
domly down sampling operator. As stated in fT5| , compressive 
operator often encodes video volume Xq through two ways. 
One is the holistic manner, i.e., y= D-H-P(xo), which directly 
collects full 3D measurements of a video sequence. The other 
is the frame-wise manner, i.e., y^= • P^^ {<) id = 

1, 2, • • • , D), which collects 2D frame-by-frame measurements 
Yd and then concatenates all yd into a long vector y. In most 
experiments of this work, compressive operator A will be set 
as the frame-by-frame one. 

C. Reconstruction and Separation of Video Volume 

As we know, recovering Xq and simultaneously separating 
Xi with X 2 from the compressive measurements y is a heavily 
ill-posed inverse problem. Hence, it is necessary to regularize 
this inverse problem by discovering the underlying video prior 
knowledge. Mathematically, the regularized inverse problem 
can be generally formulated as 

min XVt 2 (x 2 ) + (xi) 

Xo,Xi,X2 ^2) 

s.t. X 0 =X 2 +X 1 , y = ^(xo), 

where f^i(xi) and 1^2 (X 2 ) are the prior knowledge modeling 
terms on video background and foreground, respectively; 
Xq, Xi and X 2 are the vectorizations of Xq, Xi and X 2 , 
respectively; and A is a trade-off parameter between the terms 
Oi(xi) and 02 (x 2 ). 
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In the following section, we will fully discover the priors 
for surveillance videos and characterize these priors using 
tensor algebra, which naturally instantiates the general model 
in Eq. into the practical models. 

IV. Tensor RPC A MODELS FOR THE BSCM 

In this section, we will present our proposed tensor robust 
principal component (PCA) models for the BSCM task. We 
first review the basics in multi-linear algebra. Then, we 
present our basic model for video decomposition, and further 
propose detailed foreground model and background model by 
considering the spatio-temporal continuity of video foreground 
and spatial-temporal correlations of video background. In the 
background modeling, we propose two models that represent 
the video background as a single tensor and several patch- 
level tensors over groups of similar 3D patches respectively. 
We utilize tensor Tucker decomposition to model the video 
background in the aforementioned holistic and patch-based 
representations, and produce two tensor RPC A models (named 
H-TenRPCA and PG-TenRPCA), respectively. 


TABLE I 
Notations 


Notations 


Explanations 

A, X, X, X 

tensor, matrix, vector, scalar. 


• • Tiv) 

fiber of tensor A obtained by fixing 
all but one index. 

X(;, ;,i3, •' 

• • Tiv) 

slice of tensor A obtained by fixing 
all but two indices. 

^(n) 

n) 

mode-n matricization of tensor A 

G }JAx/ 2 x,- -obtained by ar¬ 
ranging the mode-n fibers as the 
columns of the resulting matrix of 
size ^ 

Vec(V) 


vectorization of tensor A. 

Ten(x) 


tensorization of vector x, i.e., the 
inverse operation of Vec. 

(ri,r2,--- 


multi-linear rank, where = 

Rank(X(„)), n = 1,2, ■ ■ • ,N. 

{x.y) 


inner product of tensor A and y. 



Frobenius norm of tensor A. 


U 

mode-n multiplication of A and 

U with the matrix representation 

Y(n) . 


A. Tensor Basics 

A tensor can be seen as a multi-index numerical array. The 
order of a tensor is the number of its modes or dimensions. A 
real-valued tensor of order N is denoted by A' G 5 J^ix ^2 - x/7v 
and its entries by " ,*Ar- Then an x 1 vector x is 

considered as a tensor of order one, and an x M matrix X 
as a tensor of order two. Subtensors are parts of the original 


tensor, created when only a fixed subset of indices is used. 
Vector-valued subtensor are called fibers, defined by fixing 
every index but one, and matrix valued subtensor are called 
slices, obtained by fixing all but two indices. Manipulation 
of tensors often requires their reformatting (reshaping); a 
particular case of reshaping tensors into matrices is termed 
as matrix unfolding or matricization. The multi-linear rank 
of a A^-order tensor is the tuple of the ranks of the mode-n 
unfoldings. The inner product of two same-sized tensors A 
and y is the sum of the products of their entries. The mode- 
n multiplication of a tensor X with a matrix U amounts to 
the multiplication of all mode-n vector fibers with U, i.e., 
(A ••• ~ The 

used tensor notations are summarized in Table [J For more 
details about multi-linear algebra, please see | [45| , | [46| . 

B. General Decomposition Model of Video Volume 

For surveillance videos in reality, we observe that there 
might exist some disturbances (e.g., randomly dynamic com¬ 
ponents) in the video background, for example, the fountain 
in the “Fountain” video and the ripple in the “WaterSurface” 
video as shown in Fig. Therefore, the assumption that the 
video background is strictly low rank may be not accurate in 
most existing work (T^, | pT| , p4| . 

Motivated by the above observation, we further decompose 
the video background (i.e., Xi in Eq. ([^) as the sum of the 
low rank component jC (the ideal video background) and the 
disturbance £ in this work. Then, the video volume Aq can be 
decomposed as Aq = A 2 see Fig|^ Accordingly, the 

general model in Eq. Q is replaced by the following more 
accurate version: 

min A02 (x2) + CT(e) + 

xo,e,£,X2 ^3^ 

s.t. Xo = X2 + e + Vec(>C), y = Al(xo), 

where e = Vec(f ), A and ( are both trade-off parameters, and 
T(e) is often specified as ^ ||e|p or ||e||i. In this work, we use 
T(e) = ^ ||e|p. In the following, we will focus on discovering 
the priors of video foreground/background and then encoding 
these priors, i.e., specifying f^ 2 (x 2 ) and ^(>C). 



Fig. 3. Illustration for the decomposition of video volume. 


C. Foreground Modeling 

The video foreground is considered as the salient moving 
objects in a video, which often occupies a certain proportion 
of contiguous region of the video frames. For example, the 
car in the “simulated” video as shown in Fig. |^b), and the 
pedestrians in the “Hall” video as shown in Fig. These 
moving objects to be detected commonly occupy a certain 
contiguous region in the spatial domain. Fig. gives an 
intuitive illustration, indicating that we hope to detect X 2 with 
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contiguous supports in spatial domain instead of disturbance 
£ with disconnected supports. Additionally, the moving trace 
of foreground object is temporally smooth, which can be 
observed from the example of car in the “simulated” video 
shown in Fig.J^b) and the pedestrians in the “Hall” video 
shown in Fig. ^ We term these two discovered structures of 
video foreground as the spatio-temporal continuity prior. 



Fig. 4. Illustration for the holistic reconstruction and separation model, (a) 
3D-TV on the voxel; (b) The ideal video background can be reconstructed by 
Tucker decomposition. 

We define a 3D total variation (TV) to model the spatio- 
temporal continuity. As shown in Fig. for the reference 
voxel (i, j, k) in video foreground A 2 , we devise the following 
quantity to describe its spatio-temporal continuity: 

TVi,j,fe(x2) ;= \X2{i,j,k) - X2{i + l,j,k)\ + 

\^2{i,j,k) - X2{i,j + l,k)\ + \X2{i,j, k) - X2{i,j,k + 1)\. 

Summing the quantity with respect to all the voxels leads to 
the proposed 3D-TV: 

||x 2||3D-TV ;= ^TVj,j,fe(x2). 

It is worth noting that in this work we assume that video 
boundaries are processed to be circular, hence 3D-TV of the 
voxels in video boundaries can be defined. 

For better illustration, we further introduce difference oper¬ 
ator to rewrite ||x 2||3 d-tv- Let X{i^j^k) denote the intensity 
at the voxel (i, j, k), and 

Xh{i,j,k) := X{i,j + l,fc) - X{i,j,k), 

Xy{i,j,k) := X{i + l,j,k) - X{i,j,k), 

Xt{i,j,k) ■.= X{i,j,kXl) - X{i,j,k), 

denote three difference operations at the voxel (i^j^k) along 
the horizontal, vertical, and temporal directions respectively. 
We can now easily introduce three difference operators with 
respect to three different direction as follows: 

D;,x:= Vec(A';,), D,x := Vec(A',), D,x := Vec(A',), 

where x = Vec(A’). Let Dx := [(D/j,x)^, (D^x)^, (D^x)^]^ 
denote the concatenation of three difference operations. It is 
easy to see that 3D-TV amounts to norm of the difference 


vectors: 

||x2||3D-TV = IIDX 2 II 1 

= ||D^X2||i + ||D„X2||i + ||DtX2||i. 

D. Background Modeling 

1) Holistic Background Modeling: As discussed in the 

introduction part, video background within a short period 
possesses the spatio-temporal correlation. The strong temporal 
correlation in video background implies that matrix unfolding 
Xi(3) in the temporal mode can be approximated by a low 
rank matrix. Mathematically, Xi('3) = U3C3 -b E(3), where 
U3 is a low rank matrix of rank D and E(3) is the 

disturbance. The weak spatial correlation in video background 
implies that the matrix unfoldings ^i(i) and ^ 1 ( 2 ) the 
height and width modes can be approximated by two high rank 
matrices, respectively. Mathematically, Xi^i) =UiCi+E(i) 
and Xi(' 2 ) = U 2 C 2 + E( 2 ), where Ui and U 2 are both two 
high rank matrices of rank ri < H and r 2 < W, respec¬ 
tively. Resorting to the well-known Tucker decomposition in 
multi-linear algebra, the matrix factorizations above can be 
aggregated together as follows: 

A’l = 0 Xi Ui X 2 U 2 X 3 U 3 + f, (5) 

where factor matrices Ui and U 2 are orthogonal in columns for 
two spatial modes, factor matrix U 3 is orthogonal in columns 
for temporal mode, core tensor Q interacts these factors, and 
£ is the disturbance. Let £ = ^ Xi Ui X 2 U 2 X 3 U 3 . We 
call jC the ideal video background. Our holistic background 
modeling is intuitively illustrated in Fig. ^h). 

Compared to matrix modeling technique, the advantage of 
tensor modeling technique is that it can not only characterize 
the temporal correlation but also the spatial correlation in 
video background. Thus it can reconstruct more accurate video 
background. 

2) Patch-based Background Modeling: Patch-based mod¬ 
eling is a popular and local style modeling technique and 
widely used in the community of image processing. Nonlocal 
self-similarity |[48|-p^ is a patch-based powerful prior and 
means that one patch in one image has many similaij^ structure 
patches. The similarity of patches implies the correlation of 
patches. In this work, we will extend this prior into 3D case 
and approximately reconstruct video background A’l (or say, 
accurately reconstruct the ideal video background C) through 
modeling the video background by groups of similar video 3D 
patches, where each patch group corresponds to a tensor. 

Specifically, we firstly segment video background A'l into 
many overlapped 3D patches of the size w x w x D and 
then collect these 3D patches as a patch set 5: 5 = {Vi G 
^wxwxD . ^ ^ where F indicates the index set and Vi is 
the i-th 3D patch in the set. These 3D patches are commonly 
similar to each other; see Fig.|^a) for an example. We clusteij^ 
the patch set S into K clusters and then collect each cluster as 
a 4-order tensor. Mathematically, let be a matrix extracting 

^Here, two patches are defined as similar if the Euclidean distance between 
two patch vectors is smaller than a given threshold. 

^The technical details concerning how to cluster will be stated in the 
subsequent subsection, Implementation Issues. 
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the o-th 3D patch in the p-th cluster 
{w‘^D) X 1, and define RpXi as: 

/ C^xi 

RpXi;= 

V y 

where N is the number of 3D patches in the p-th cluster. Then 
RpXi can be reshaped into a 4-order tensor Ten(RpX^ of the 
size w X w X D X N, denoted by see Fig. for an 

intuitive illustration. Because the patches in each cluster have 
very similar structures, Ten(RpXi) can then be expectedly 
approximated by a low rank tensor Cp, i.e., Ten(RpXi) ^ jCp. 
The modeling of Cp will be determined shortly. Then, the 
clean and ideal video background can be estimated by solving 
the following optimization problem: 

K 

min ||R^(xi)-Vec(/:^)|p. 

^ p=i 

The solution of this optimization problem can be easily de¬ 
rived as xi = (Z^pRjRp)”^ Let us denote 

Ten(xi) by C. Hence, C can be represented as: 

£ = Ten((^RjRp)-i^RjVec(£p)), 

p p 

which means that the ideal video background C can be 
obtained by summing all clusters followed by an averaging 
operation. When the patches in the patch set S are not 
overlapped, (Z^pRjRp)”^ reduces to an identify matrix. 
Fig. [^a) illustrates this procedure in which the averaging 
operation is not required. 


s a vector of the size 

\ 

, ( 6 ) 



A :=A(A 

(b) 


Fig. 5. Illustration for the patch-based background modeling, (a) The non- 
overlapped patches on the ideal video background can be clustered into three 
clusters; (b) The 4-order tensor composed of each cluster can be reconstructed 
by low rank Tucker decomposition. 

jCp is one 4-order tensor of the size w x w x D x N which 
collects all 3D patches in the p-th cluster. Because the ideal 
video background possesses a strong correlation among the 
frames, {Cp)(^s) is low rank. Moreover, the observation that 
the patches in each cluster have very similar structures implies 
that (>Cp)( 4 ) is also low rank. Combining these two points, we 

^Tlp indicates the operation which first extracts all 3D patches in the p- 
cluster from the video volume, and then arranges these 3D patches as a 4-order 
tensor, i.e., 7^p(A’i) = Ten(RpVec(A’i)) while IZj indicates its inverse- 
order operation, i.e., IZp (Cp) = Ten(RjVec(>Cp)); see Fig.j^ 


can likewise model jCp by Tucker decomposition: 

I^p — Qp X Uip X2 ^ 2 p ^ U3 X U4p, 

where Qp is core tensor, and Ui^, U 2 p, U 3 and IJ^p are factor 
matrices orthogonal in columns. Note that the factor matrices 
in the temporal mode for all p are set as a shared matrix U 3 , 
insuring that jC is low rank in the temporal mode on the whole. 
Fig. I^b) gives an intuitive illustration. The video background 
now can be modeled as: 

T’l = £ + f 

= Ten((^ RjRp)-' RjVec(£p)) + f, («) 

p p 

where factor matrices JJjp {j = 1,2,4) and U 3 are orthogonal 
in columns. 


E. Reconstruction and Separation Models 

We now can instantiate the general model in Eq. 
Integrating the modelings of video foreground in Eq. <0 
and video background in Eq. 0 into the general model in 
Eq. ^ leads to the following holistic TenRPCA model (H- 
TenRPCA). 

s.t. Xo = X2 + e + Vec(0 Xi Ui X2 U 2 X 3 U 3 ), 
y = ^(xo), 

where the factor matrices Uy {j = 1,2,3) are orthogonal in 
columns. 

Likewise, integrating the modeling of video foreground in 
Eq.0 and the patch-based modeling of video background in 
Eq. ^ leads to the following patch-group-based tensor RPCA 
model (PG-TenRPCA): 

A||Dx 2 ||i + i||e||2 

Qp ,XJ3 

s.f. Xo =X2-|-e -b(^RpRp)“^y]RpVec 

p p 

{Qp X2 U2p X3 U3 X4 U4p), 

y = ^(xo), 

where the factor matrices JJjp {j = 1,2,4) and U 3 are 
orthogonal in columns. 

In the following section, we will design efficient algorithms 
to solve the proposed models. Note that these models are non- 
convex, and therefore, we can only wish to find local solutions. 


V. Optimization Algorithms 

In this section, we first develop an efficient algorithm based 
on ADMM for solving the proposed model of H-TenRPCA 
in Eq. #• Then, the algorithm is slightly modified to solve 
the PG-TenRPCA model in Eq. Einally, we present the 
implementation details of our optimization algorithms. 












A. Optimization Algorithm for H-TenRPCA 


We optimize the H-TenRPCA model using a multi-block 
version of the alternating direction method of multipliers 
(ADMM) The H-TenRPCA model in Eq. (|^ can 

be rewritten as the following equivalent form: 

S.t. f = Dx2, (11) 

Xo =X2 + e + Vec(0 XiUi X2U2 X3 U3), 
y = A(xo), 

where the factor matrices IJj (j = 1,2,3) are orthogonal in 
columns. This constrained optimization problem can be solved 
by its Lagrangian dual form. The augmented Lagrangian 
function of problem in Eq. O can be written as: 

L^(xo,a,U,,e,X2,f) = A||f||i + ^||e||2 

-(A^f-Dx2) + ^||f-Dx2||2 

- (A’'“,xo - X2 - e - Vec(0 Xi Ui X2 U2 X3 U3)) 

R^o 

+ ^11^0 - X2 - e - Vec(0 Xi Ui Xj U2 X3 U3)f 

- (Ay,y-A(xo)) -h Ylly-“^(xo)lX 

where A^, A^° and A^ are the Lagrange multiplier vectors, 
and and are positive penalty scalars. It is difficult 

to simultaneously optimize all these variables. We therefore 
approximately solve this optimization problem by alternatively 
minimizing one variable with the others fixed. This procedure 
is the so-called multi-block alternating direction method of 
multiples (ADMM). Under the framework of multi-block 
ADMM, the optimization problem of La with respect to each 
variable can be solved by the following sub-problems: 


7) Xo sub-problem: Optimizing La with respect to xq can 
be treated as solving the following linear system: 


(/7-°I + ^MM)xo = 

AXO + ^xo (X 2 -h e + Vec(£)) + A*{^^y - A^), 

where A* indicates the adjoint of A and C = Q XiUi X 2 
U 2 X 3 U 3 . Obviously, this linear system can be solved by 
off-the-shelf conjugate gradient techniques. When AA" = I, 
this linear system has the following closed-form solution: 


Xo = (I - 


/3^o (3y 


A*A) 


cy 


( 12 ) 


where = A^“ {x 2 + e + Vec(£)) A*{^yy - A^). 


2) Q and sub-problems: The optimization sub-problem 
of La with respect to Q and (i = 1,2,3) can be rewritten 
as: 

minill^-axiUiXaUaxUall^F s.t. UfUi=I, (13) 

z 


where Xi = Xq — X 2 — £ — Ten(^^). This sub-problem can 
be solved by the classic HOOI algorithm | [45| , | [46| . 


3) e sub-problem: The sub-problem of La with respect to 
e can be solved by 


_ (xo - X2 - Vec(£) - 

® “ 1-f/3^o 

where £ = ^ Xi Ui X 2 U 2 X 3 U 3 . 


(14) 


4) X 2 sub-problem: The sub-problem of La with respect 
to X 2 can be solved by the following linear system: 

(/ 3 ^°I-f^^D*D)x 2 = /3’'°(xo-Vec(£)-e)-A^'>-fD*(/?^f-A^), 

where D* indicates the adjoint of D. Let Cb = Ten(/3^° (xq — 
Vec(£) - e) - A^o + - A^)). Thanks to the block- 

circulant structure of the matrix corresponding to the operator 
D*D, it can be diagonalized by the 3D LET matrix. Therefore, 

X 2 can be fast computed by 

•fft ( fftn(Cb) \ 

' + /3f(|fftn(D;,)P + |fftn(D,)P + |fftn(D,)P)7 ’ 

(15) 

where fftn and ifftn respectively indicate fast 3D Lourier trans¬ 
form and its inverse transform, | • p is the element-wise square, 
and the division is also performed element-wisely. Note that 
the denominator in the equation can be pre-calculated outside 
the main loop, avoiding the extra computational cost. 

5) f sub-problem: The sub-problem of La with respect to 
f can be rewritten as 

ininA||f||i + y l|f - (Dxz -f 

This sub-problem can be solved by the well-known soft 
shrinkage operator as follows: 

A^ A 

f = soft(Dx 2 -f (16) 

where soft(a, r) := sgn(a) • max(|a| — r, 0 ). 

6) updating multipliers: According to the ADMM, the 
multipliers associated with La are updated by the following 
formulas: 

r Af ^ A^ - 7 /?f (f - Dx 2 ) 

I A^o ^ A^o - 7 / 3^0 (xo - Vec(£) - e - xa) (17) 

{ A^ ^ Ay-7/3y(y-Al(xo)), 

where 7 is a parameter associated with convergence rate with 
the value, e.g., 1 . 1 , and the penalty parameters and 

follow an adaptive updating scheme. Take as an example. 

Let nRes = ||y — ^(xq)|| and nReSpre the value of last 

iteration. is initialized by a small value - u ( w and 

then updated by the scheme: 

^ Cl ' if nRes > C 2 • nReSpre, (18) 

where ci and C 2 can be taken as 1.15 and 0.95, respectively. 

Let us denote the rank constraint of Ui, U 2 and U 3 by ri, 
r 2 and The proposed algorithm for H-TenRPCA can now 
be summarized in Algorithmic 






9 


Algorithm 1 Optimization algorithm for H-TenRPCA. 

Input: The measurements y; The algorithm parameters: rs 
and A. 

Initialization: ri = ceil(i7 x 0.65) and r 2 = ceil(lT^ x 0.65); 
C is initialized by (ri,r 2 ,rs)-Tucker approximation of 
Ten(^*(y)); X 2 = A*(y) — Vec(>C); Other variables are 
initialized by 0 . 

Output: xo, X 2 , and xi = Vec(£). 

1 : while not converged do 
2 : Updating xq via Eq. 

3: Updating Q and or £ via Eq. 

4: Updating e via Eq. ([U; 

5: Updating X 2 via Eq. 

6 : Updating f via Eq. 

7: Updating multipliers and the related parameters via 

Eqs. (rf\ and ( p^ . 

8 : end while 


B. Optimization Algorithm for PG-TenRPCA 

We now slightly modify the Algorithm to solve the PG- 
TenRPCA model in Eq. The major modification is that 
the sub-problem in Eq. OH* is replaced by the following 
optimization problem: 

min - Ten((^RjRp)-i ^RjVec(ap Xj Ui^ 

ypt z 

Ujj.,U3 P P 

X2 U2p X3 IJ3 X4 IJ4p)) 11 ^ 

s.t. UjpU,p = I (i = l,2,4), U^U 3 = I. 


Algorithm 2 Joint HOOI Algorithm for minimizing 

Input: The initialization of Ui^, U 2 p, U 3 and U 4 p; 'Rp{A!i). 
Output: Uip, U 2 p, U 3 and IJ^p; Qp. 


while not converged do 


Updating Qp via Eq. (20); 


Updating Uip, U 2 p and U 4 p via Eqs. (21), (22) 
and ( |^ ; 

Updating U 3 via Eq. ([24] 

end while 


The algorithm for the PG-TenRPCA model now can be 
easily designed through replacing step 3 in Algorithm by 
solving the optimization problem in Eq. Additionally, 

the clustering, or say the updating of {p = 1 , 2 , • • • , iT) is 
performed every some iterations, e.g., 8 iterations, and the first 
clustering is performed over an initialized video background. 

It is obvious that our proposed models are non-convex and 
non-separable optimization problems. Therefore, there may 
exist many local minimizers and a suitable initialization is 
crucial for attaining the desired solution. Although the conver¬ 
gence of the multi-block ADMM for this kind of optimization 
problems, to the best of our knowledge, is not guaranteed, 
the experimental results in Section VI will justify that given 
a suitable initialization, the proposed algorithms based on the 
multi-block ADMM with the adaptive scheme can produce 
satisfactory results. Specifically, for the optimization algo¬ 
rithm of H-TenRPCA, we initialize C by (ri,r 2 ,r 3 )-Tucker 
decomposition of Ten(^*(y)) and X 2 by ^*(y)-Vec(£). Eor 
optimization algorithm of PG-TenRPCA, the result from H- 
TenPCA algorithm provides a suitable initialization. 


This optimization problem can be converted to the following 
optimization problem: 

^ 1 ~ 

min o ll^p(^i) “ Qp Xi Uip X2 \^ 2 p X3 U3 X4 U4p|||^ 

y-p 1 - ^ 

Ujp,U3 

s.t. UjpU,p = I (i = l,2,4), U^U 3 = I, 

(19) 

where lZp{Xi) = Ten(RpXi) and xi is the vectorization 
of ^ 1 . 

The optimization problem in Eq. ( p^ can be approximately 
solved by alternatively updating the following formulas: 

X1 U?p X2 U^p X3 X4 Ulp (20) 

Ulp = SVD((7^p(:^) X2Uip X3U^ X4Ulp)(i),ri) (21) 

U2p = SVD((7^p(:^) xi Ufp X3 X4 Ulp)(2),r2) (22) 

U4p = SVD((7^p(:^) xi Ufp X2 U^p X3 U^)(4),r4) (23) 

K 

U 3 = eigs(y]ZpZj,r 3 ), (24) 

p=l 

where Zp = (7^^(A^) X 2 UI ^^^Ip)i 3 y SVD(A,r) 

indicates top r singular vectors of matrix A, and eigs(A, r) in¬ 
dicates top r eigenvectors of matrix A. The detailed derivation 
is listed in the Appendix. This iterative procedure is termed 
as Joint HOOI Algorithm presented in Algorithm 


C. Implementation Issues 

In Algorithm there exist four parameters, i.e., ri, r 2 , 
r 3 and A, where ri and r 2 control the complexity of spatial 
redundancy, controls the complexity of temporal redun¬ 
dancy, and A provides a trade-off between disturbance and 
foreground modeling, ri and r 2 for factor matrices Ui and 
U 2 are empirically taken as ri = ceiQiif x 0.65) and 
r 2 = ceil(lU x 0.65) in all conducted experiments and we 
indeed find this setting works fairly well. Actually, such 
selected ri and r 2 can make the AccEgyR index attain the 
ratio over 0.9 for various natural images and we have showed 
some examples in Eig. 2(d). Eor and A, it is required to 
carefully tune them for testing data sets. We empirically found 
that our algorithm will achieve satisfactory performance when 
r 3 is taken as the value 1 for the real-world data sets and A 
is taken in the range [ 0 . 01 , 0 . 1 ]. 

In the optimization algorithm for PG-TenRPCA model in 
Eq. ([T0|., we need to set nine parameters, i.e., the size of 3D 
patch w, the size of search window around one patch S, the 
number of collected similar 3D patches N, the sliding distance 
d, the rank parameters ri, r 2 , rs, r^, and the traded-off 
parameter A. Empirically, w, d, S and N are respectively taken 
as 8 , 7, 36, and 45 [ [4^ , | [50| . The rank constraint parameters 
ri, r 2 , r 4 are respectively set to 8 , 8 , and ceil(45 x 0.45). 

^ceil(a) indicates the smallest integer larger than a. 
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Here r 4 = ceil(45 x 0.45) implies that our algorithm makes 
low rank approximation due to the large redundancy hidden 
in the similar patches. Similar to Algorithm we only need 
to carefully tune rs for temporal complexity and the trade¬ 
off parameter A. We empirically found that is taken as the 
value 1 for the real world data sets, and A is taken in the 
range [0.05,0.1]. It is worth noting that in order to reduce 
the computational cost, the clustering is performed by the K- 
nearest neighbor method. Specifically, for each 3D patch of 
the patch set S, we search N similar 3D patches as a cluster 
from a big window around this 3D patch. 


VI. Experimental Results 


In this section, we will conduct experiments on synthetic 
and real video datasets to demonstrate the superiority of two 
proposed models, i.e., H-TenRPCA and PG-TenPCA, over the 
existing state-of-the-art approaches for the BSCM task. All 
the experiments are performed using MATLAB (R2013a) on 
workstations with dual-core Intel processor of 2.90 GHz and 
RAM of 30 GB equipped with Windows 7 OS. The parameter 
tuning is performed by grid search for our proposed methods 
as well as the compared methods such that the following 
averaged PSNR index over video frames achieves the best 
value. 

We first introduce the evaluation measures. We use 
F-measure to assess the detection performance of video fore¬ 
ground, and the peak signal-to-noise ratio (PSNR) and the 
structural similarity index (SSIM) to measure the recon¬ 
struction accuracies. F-measure is defined as: F-measure = 
^ predston+reraip where recall and precision are defined as: 

#correctly classified foreground pixels 


recall = 


#foreground pixels in ground truth 
#correctly classified foreground pixels 


precision = . t t • i r i • 

#pixels classified as foreground 

PSNR and SSIM commonly measure the similarity of two 
images in intensity and structure respectively. PSNR is de¬ 


fined as: PSNR := 10 x log 


10 


255 ^ 




, where lij and 


lij are respectively the intensity values of the original and 
reconstruction images at the pixel (i, j). SSIM measures the 
structural similarity of two images; see p9| for details. We 
use averaged PSNR and SSIM over video frames to evaluate 
reconstruction performance of video volume. Higher values of 
F-measure, PSNR and SSIM indicate the better performance. 


A. Data Sets 

1) Synthetic Data: The SAB s|^ (Stuttgart Artificial Back¬ 
ground Subtraction) dataset is an artificial dataset for pixel- 
wise evaluation of background models. The dataset consists of 
video sequences for nine different challenges of background 
subtraction. The basic class of nine different challenges is used 
to evaluate our proposed approach. We collect 128 frames 
(say, NoForegroundDay0001^NoForegroundDay0128) from 
the SABS-basic data, and then scale each frame into an 

^ http : //w w w. vis. uni- Stuttgart. de/index .php ?id=sabs 



(a) Hall (b) Bootstrap (c) ShoppingMall (d) Lobby (e) Highway 



(f) Office (g) PET-2006 (h) Pedestrains (i) Traffic 0 ) Rain 




WM. 

IP 


(k) WalkbyShopl front (1) Browse2 (m) ShopAssistantl front (n) Bungalows (o) CopyMachine 



(p) Cubicle (q) PeoplelnShade (r) Fountain (s) Curtain (t) WaterSurface 


Eig. 6. Sampled images from real videos. 


image of size 128x128 as a frame of the true background. 
Similarly, we choose 128 frames (say, GT0807-GT0934) as 
the foreground from SABS-GT data and then transform the 
intensity of these gray images into the range from 200 to 255 
for visual contrast to the background. Then, it is easy to obtain 
the original video volume Xq by combining the background 
Xi and the foreground X 2 . The example video shown in Fig.[^ 
is from this dataset. 

2) Real Data: We collect a set of real world videos from 
CAVIAR dataset jbOlP I2R dataset [ bTIP UCSD dataset 


and CD.net dataset |63|P^ These data sets include various 


real world scenes ranging from the simple scenes with static 
backgrounds to the complex scenes with camera jitter or 
intermittent object motion. From these data sets, we choose 
three categories of videos for testing our approach: static back¬ 
ground (Fig. I^a)-(m)), shadow (Fig. [^n)-(q)), and dynamic 
background (Fig. |^r)-(t)). For each video, 128 gray-scale 
video frames are chosen as video volume for our experiments. 


B. Empirical Analysis for Algorithm Convergence 

We provide an empirical analysis for the convergence of 
the proposed optimization algorithms on a synthetic video 
shown in 

Fig. 0c). 

and the relative error relErrA := niax(i ||Ao |[f) 

assessment index of algorithm convergence, where A^ is the 

result in k-i\\ iteration and Aq is the ground-truth result. 

In Fig. [ 7 ] we show the curves of the relative change and 
the relative error of video volume Xq and video foreground 
X 2 for algorithms H-TenRPCA and PG-TenRPCA, where 
the sampling ratio is set as 0.04 (1/25) and 0.05 (1/20), 
respectively. relChgX 2 denotes the relative change of video 
foreground X 2 . In Fig. [7][a)-(b), we show the convergence 
results of H-TenRPCA and PG-TenRPCA on the synthetic 
video, and in Fig. |^c)-(d), the convergence results on a 
real video. Note that in Fig. |7][c)-(d), we do not provide the 

^ http: //group s. inf. ed. ac. uk/vi sion/ CAVIAR/C AVI ARDATAI / 
^http://perception.i2r.a-star.edu.sg/bk_model/bk_index.html 
^http://www.svcLucsd.edu/projects/background_subtraction/ 
^*^http://changedetection.net 


Fig. and a real video “ShoppingMall” shown in 
The relative change relChgA := 11^“^ 


max(l, ||A^“1 ||f) 
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(a) H-TenRPCA (Synthetic Data) 




(b) PG-TenRPCA (Synthetic Data) 




Iter Iter 

(c) H-TenRPCA (Real Data) 

Fig. 7. The empirical analysis of algorithm convergence 




(d) PG-TenRPCA (Real Data) 


convergence results for the relative error of video foreground, 
because the ground-truth video foreground for real data is 
unknown. 

Generally, the relative change converges to zero when the 
number of iterations is high, and the corresponding rela¬ 
tive error w.r.t. ground-truth gradually decreases to a stable 
value. From Fig. [TJa) and (c), we observe a significant jump 
of relative change for video foreground relChgX 2 between 
iterations [60, 100], and the jump corresponds to a large 
decrease of relative error shown in the right subfigures of 
Fig. [TJa) and (c). Thus this jump corresponds to a sudden 
significant improvement on the foreground estimation during 
the optimization procedures. From all subfigures in Fig. 
we observe that the curves of all assessment indices reduce 
to a stable value when the algorithms reach a relatively high 
iteration number, which suggests that the proposed algorithms 
well converge empirically. 

C. Comparison with Existing Popular Methods 

We compare our models H-TenRPCA and PG-TenRPCA 
with three existing popular methods: SpaRCS |T9| , SpLR 
and ReProCS p0| . The SpaRCS and SpLR methods are both 
batch-based approaches as ours that process a batch of video 
frames, i.e., a video volume, as a whole. Whereas, ReProCS is 
an online method that processes the video frames sequentially. 
It requires to use the training video frames to initialize a video 
background, and requires the compressive operator over each 
frame to be the same. For fair comparison, we thus create an 
additional subsection to compare with the ReProCS method. 

1) Comparison with Batch-Based Methods: In this sub¬ 
section, we compare our approach with SpaRCS and 
SpLR on synthetic data and real data sets. Considering 
the feasibility on current chips of CS cameras, the randomly 
permuted Walsh-Hardmard in the frame-wise manner is chosen 
as compressive operator. That is, the compressive operator is 
chosen as {d = 1, • • • for all compared 

methods. The sampling ratios are set as two high levels 1/5 
and 1/10, and three low levels 1/20, 1/25 and 1/30 for assessing 
the reconstruction and separation performance of all compared 


methods. For illustrating the merits of tensor modeling tech¬ 
nique, we also compare the degenerated version of our method 
H-TenRPCA, where video background is modeled by a low 
rank matrix instead of a tensor on the synthetic video data. 
The degenerated version is dubbed as H-MatRPCA. 

TABLE II 

Comparison of different methods on the synthetic video. Note 

THAT PSNR HERE INDICATES THE AVERAGED PSNR ON ALL VIDEO 
ERAMES. The SAME EOR SSIM AND F-MEASURE. 


SR 

Indices 

SpaRCS 

SpLR 

H-MatRPCA 

H-TenRPCA 

PG-TenRPCA 


PSNR 

26.42 

45.07 

45.38 

42.33 

40.15 

1/5 

SSIM 

0.8678 

0.9955 

0.9963 

0.9918 

0.9894 


F-measure 

0.6194 

0.9195 

0.8617 

0.8624 

0.8598 


PSNR 

17.03 

34.09 

35.16 

34.95 

34.38 

1/10 

SSIM 

0.4723 

0.9627 

0.9726 

0.9695 

0.9673 


F-measure 

0.0704 

0.8909 

0.8601 

0.8616 

0.8566 


PSNR 

14.14 

25.40 

30.53 

30.64 

30.40 

1/20 

SSIM 

0.2405 

0.8184 

0.9245 

0.9275 

0.9299 


F-measure 

0.0333 

0.7069 

0.8327 

0.8337 

0.8342 


PSNR 

13.86 

23.57 

28.45 

28.80 

29.61 

1/25 

SSIM 

0.2129 

0.7541 

0.8776 

0.8901 

0.9236 


F-measure 

0.0311 

0.5854 

0.8172 

0.8182 

0.8240 


PSNR 

13.52 

22.46 

26.92 

27.36 

28.88 

1/30 

SSIM 

0.1782 

0.7039 

0.8268 

0.8492 

0.9133 


F-measure 

0.0301 

0.4767 

0.8080 

0.8086 

0.8148 


We show the quantitative results of all compared meth¬ 
ods with different sampling ratios on the synthetic data in 
Table The averaged PSNR and SSIM values indicate the 
reconstruction performance of the original video, and the av¬ 
eraged F-measure values indicate the separation (or detection) 
performance of video foreground. We observe that, when the 
sampling ratio is taken as a high value of 1/5, all methods 
can reconstruct the original video and detect a satisfactory 
silhouette of video foreground. When the sampling ratio goes 
down, our proposed models perform consistently better than 
all the compared methods. First, our tensor based models 
work significantly better than the conventional methods, i.e., 
SpaRCS and SpLR, and also the matrix version of our 
model, i.e., H-MatRPCA. Second, the PG-TenRPCA model 
that is based on video patch groups works better than the 
H-TenRPCA model that takes the video volume as a single 
tensor. 

Figure [^a) and (b) show the visual results when the 
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SpaRCS SpLR H-MatRPCA H-TenRPCA PG-TenRPCA Ground-Truth 



Fig. 8. Comparison of visual results of different methods on a frame of the synthetic video, (a) shows the results of reconstruction and detection of one video 
frame with a high sampling ratio 0.2 (1/5); while (b) shows the case with a low sampling ratio 0.04 (1/25) for comparison. 


sampling ratio (SR) is taken as 1/5 and 1/25 respectively. 
The last column shows the ground-truth videos and silhouettes 
of video foregrounds. It can be observed that, when the 
sampling ratio is a low value of 1/25, the SpaRCS method 
totally fails in reconstructing the original video and detecting 
the moving car. Although the SpLR method can produce 
a slightly better result, the car in the reconstructed video 
is blurred and the detected car is incomplete and disturbed 
by noisy points. Compared with the ground-truth in the last 
column, H-MatRPCA, H-TenRPCA and PG-TenRPCA can 
produce satisfactory visual results but the reconstructed video 
by H-MatRPCA is not clear and sharp as two other models. 
It is worth noting that, compared with other methods, the 
reconstructed video by PG-TenRPCA is very clear and sharp 
due to the powerfulness of nonlocal self-similarity prior. 


We also present the quantitative results of all compared 
methods on a real captured video “ShoppingMall” with labeled 
foregrounds in several video frames in Table The visual 
comparison results on this video are shown in Fig. Com¬ 
pared with the synthetic video, this video is more challenging 
due to the multiple walking persons in the video. From the 
Table and Fig. we can observe that, when the sampling 
ratio is taken as a high value of 1/5, all methods except 
the SpaRCS method can reconstruct a high-quality video and 
detect relatively satisfactory silhouettes of the walking persons. 
The SpaRCS method failed in detecting the walking persons 
in Fig. I^a), which might be because of the insufficiency of 
the simple sparse prior used for video foreground in SpaRCS. 
When the sampling ratio goes down, the SpLR method also 
failed in detecting moving objects. However, our proposed H- 
TenRPCA and PG-TenRPCA models can still produce sat¬ 
isfactory results. Moreover, compared with the H-TenRPCA 
method, the PG-TenRPCA method works better both visually 
and quantitatively, because it uses a well-designed patch-based 
prior, i.e., the nonlocal self-similarity, to model the patch-level 


correlations of video background. As shown in Fig. [^b), the 
reconstructed region indicated by the red box is more clear and 
sharper than the region indicated by the light blue box (Best 
seen in the zoom-in version of pdf), which can be further 
illustrated in Fig. We further provide more experimental 

TABLE III 

Comparison of different methods on the ShoppingMall video. 

Note that PSNR here indicates the averaged PSNR on all 
ERAMES. The same eor SSIM and F-measure. 


SR 

Indices 

SpaRCS 

SpLR 

H-TenRPCA 

PG-TenRPCA 


PSNR 

25.55 

35.38 

41.02 

40.25 

1/5 

SSIM 

0.8290 

0.9468 

0.9768 

0.9736 


F-measure 

0.1086 

0.6647 

0.6714 

0.6672 


PSNR 

24.41 

28.85 

37.03 

36.38 

1/10 

SSIM 

0.7462 

0.8434 

0.9574 

0.9535 


F-measure 

0.0366 

0.5318 

0.6611 

0.6568 


PSNR 

22.37 

25.19 

31.34 

32.48 

1/20 

SSIM 

0.5730 

0.6876 

0.8779 

0.9197 


F-measure 

0.0109 

0.2207 

0.6186 

0.6314 


PSNR 

21.49 

24.56 

29.85 

31.47 

1/25 

SSIM 

0.5155 

0.6378 

0.8259 

0.9058 


F-measure 

0.0123 

0.1646 

0.6008 

0.6146 


PSNR 

20.76 

23.97 

28.38 

30.71 

1/30 

SSIM 

0.4651 

0.5847 

0.7561 

0.8948 


F-measure 

0.0099 

0.1298 

0.5842 

0.5964 


results on various real videos to demonstrate the effectiveness 
of our proposed models, especially for the low sampling ratios. 
In Table |IV| and Table [V| we show the quantitative results on 
multiple real videos with sampling ratio of 1/25 and the aver¬ 
aged quantitative results on these videos with different sam¬ 
ple ratios respectively. Observed from Table 1^ our tensor- 


based models perform significantly better with much higher 
PSNR, SSIM and F-measure values than all the compared 
methods at a low sampling ratio of 1/25. From Table |Vj we 
observe that, our proposed models work overall better across 
different sampling ratios. At a high sampling ratio of 1/5, the 
compared methods of SpaRCS and SpLR can also reconstruct 
the original video and detect foregrounds with gracefully high 
values of PSNR, SSIM and F-measure, but still significantly 
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TABLE IV 

The comparison results on real data with sampling ratio 1/25. 
Note that PSNR here indicates the averaged PSNR on all 
ERAMES. The same eor SSIM and F-measure. We ignore the 

STATISTICS OE F-MEASURE BY THE NOTATION BECAUSE THE 
GROUND-TRUTH SILHOUETTE OE VIDEO EOREGROUND IS NOT PROVIDED. 


Videos 

Indices 

SpaRCS 

SpLR 

H-TenRPCA 

PG-TenRPCA 


PSNR 

17.85 

20.58 

27.28 

28.48 

a 

SSIM 

0.4974 

0.6310 

0.8469 

0.9077 


F-measure 

0.0049 

0.2534 

0.6041 

0.6132 


PSNR 

18.51 

21.88 

27.11 

28.91 

b 

SSIM 

0.4226 

0.5819 

0.7929 

0.8786 


F-measure 

0.0035 

0.2577 

0.6404 

0.6707 


PSNR 

21.49 

24.56 

29.85 

31.47 

c 

SSIM 

0.5155 

0.6378 

0.8259 

0.9058 


F-measure 

0.0043 

0.1646 

0.6008 

0.6146 


PSNR 

25.83 

34.13 

40.92 

36.90 

d 

SSIM 

0.7010 

0.9029 

0.9792 

0.9737 


F-measure 

0.0I3I 

0.5505 

0.6938 

0.6584 


PSNR 

13.97 

20.27 

27.59 

29.47 

e 

SSIM 

0.I97I 

0.5109 

0.7949 

0.8980 


F-measure 

0.0749 

0.1794 

0.6218 

0.6444 


PSNR 

10.38 

21.39 

28.40 

34.41 

f 

SSIM 

0.1077 

0.5167 

0.8247 

0.9565 


F-measure 

0.0479 

0.II56 

0.6203 

0.6785 


PSNR 

13.26 

25.45 

38.52 

40.10 

g 

SSIM 

0.1678 

0.6823 

0.9603 

0.9794 


F-measure 

0.0320 

0.5937 

0.7663 

0.7708 


PSNR 

16.25 

26.82 

35.99 

36.57 

h 

SSIM 

0.3474 

0.7490 

0.9472 

0.9661 


F-measure 

0.0I7I 

0.5245 

0.6086 

0.6141 


PSNR 

19.27 

35.73 

39.98 

43.23 

i 

SSIM 

0.4205 

0.8627 

0.9459 

0.9670 


F-measure 

0.0238 

0.1078 

0.2651 

0.3478 


PSNR 

19.46 

33.06 

36.11 

38.15 

j 

SSIM 

0.4138 

0.8288 

0.9076 

0.9378 


F-measure 

- 

- 

- 

- 


PSNR 

19.73 

23.55 

32.09 

37.88 

k 

SSIM 

0.5361 

0.6545 

0.9125 

0.9771 


F-measure 

- 

- 

- 

- 


PSNR 

24.03 

16.36 

31.15 

37.05 

1 

SSIM 

0.6235 

0.4018 

0.8417 

0.9612 


F-measure 

- 

- 

- 

- 


PSNR 

25.24 

31.03 

36.80 

38.90 

m 

SSIM 

0.7183 

0.8715 

0.9725 

0.9873 


F-measure 

- 

- 

- 

- 


PSNR 

16.70 

21.29 

28.32 

31.22 

n 

SSIM 

0.3441 

0.4603 

0.7510 

0.8593 


F-measure 

0.0022 

0.1654 

0.3182 

0.3249 


PSNR 

15.57 

19.24 

31.83 

33.83 

0 

SSIM 

0.3304 

0.4321 

0.8726 

0.9393 


F-measure 

0.0038 

0.1556 

0.8210 

0.8307 


PSNR 

15.18 

20.02 

27.65 

34.79 

P 

SSIM 

0.2733 

0.4052 

0.7610 

0.9454 


F-measure 

0.0028 

0.II59 

0.5534 

0.6128 


PSNR 

17.67 

21.45 

27.19 

33.77 

q 

SSIM 

0.3648 

0.5137 

0.7899 

0.9466 


F-measure 

0.0035 

0.I9I5 

0.7056 

0.7812 


PSNR 

21.78 

26.68 

30.78 

30.64 

r 

SSIM 

0.6817 

0.8338 

0.9136 

0.9257 


F-measure 

0.0064 

0.2954 

0.7450 

0.7602 


PSNR 

21.03 

23.92 

28.93 

32.72 

s 

SSIM 

0.4253 

0.5449 

0.7643 

0.8975 


F-measure 

0.0025 

0.1726 

0.5257 

0.5845 


PSNR 

17.57 

22.20 

29.79 

30.67 

t 

SSIM 

0.3183 

0.4678 

0.7450 

0.7912 


F-measure 

0.0026 

0.1795 

0.8515 

0.8534 


SpaRCS SpLR H-TenRPCA PG-TenRPCA Ground-Truth 



Fig. 9. Visual results of different methods on a frame of the ShoppingMall 
video, (a) shows the results of reconstruction and detection of one video 
frame with a high sampling ratio 0.2 (1/5); while (b) shows the case with a 
low sampling ratio 0.04 (1/25) for comparison. 


H-TenPCA PG-TenPCA H-TenPCA PG-TenPCA 



Fig. 10. Comparison of PG-TenRPCA with H-TenRPCA with respect to 
reconstruction performance with the sampling ratio 0.04. 


lower than ours. When the sampling ratio goes down, the 
models SpaRCS and SpLR fail to well reconstruct the videos 
and separate the video foregrounds, but our proposed models 
of H-TenRPCA and PG-TenRPCA can still perform very well 
with high values of PSNR, SSIM and F-measure. Moreover, 
the PG-TenRPCA model is superior over H-TenRPCA on 
average. In Fig. [m we further visually show the results of 
reconstruction and separation by different methods on real six 
videos. Figure [TTJa)-(c) show results of three videos with a 
high sampling ratio of 1/5, and Figure [TTJd) -(f) show results 
of three videos with a low sampling ratio of 1/25. We can 
see that when the sampling ratio is high, all these methods 
can produce a good result except that the SpaRCS method 
detects incomplete silhouettes of video foregrounds. When 
the sampling ratio is low, the SpaRCS method totally fails 
and the SpLR method can reconstruct the video backgrounds 
but its detected video foregrounds are blurred and incomplete. 
However, the proposed H-TenPCA and PG-TenPCA models 
can well reconstruct a sharp video and detect a relatively 
complete video foreground on each example. 

2) Comparison with the Online Method ReProCS: We also 
compared our methods with the ReProCS method on four 
videos (synthetic video. Pedestrians, Cubicle, and Office). For 
each video, we can find a frame sequence of video background 
to train an initialized background for the ReProCS method. 
The compressive operator is chosen as the randomly permuted 
Walsh-Hardmard transform in the frame-wise manner, but set 
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TABLE V 

The averaged results oe dieeerent methods on various real 

VIDEOS. 


SR 

Indices 

SpaRCS 

SpLR 

H-TenRPCA 

PG-TenRPCA 


PSNR 

27.50 

38.26 

42.42 

41.13 

1/5 

SSIM 

0.7812 

0.9488 

0.9788 

0.9756 


F-measure 

0.1946 

0.6269 

0.6799 

0.6803 


PSNR 

23.48 

32.34 

39.41 

38.13 

1/10 

SSIM 

0.6206 

0.8430 

0.9629 

0.9599 


F-measure 

0.0424 

0.5368 

0.6756 

0.6760 


PSNR 

19.53 

26.84 

35.01 

35.51 

1/20 

SSIM 

0.4665 

0.7049 

0.9153 

0.9396 


F-measure 

0.0158 

0.3105 

0.6393 

0.6536 


PSNR 

18.54 

24.48 

31.82 

34.46 

1/25 

SSIM 

0.4203 

0.6245 

0.8575 

0.9301 


F-measure 

0.0155 

0.2514 

0.6213 

0.6475 


PSNR 

17.87 

24.36 

31.15 

33.86 

1/30 

SSIM 

0.3943 

0.6026 

0.8239 

0.9245 


F-measure 

0.0150 

0.2369 

0.6202 

0.6389 


Ground-Truth SpaRCS SpLR H-TenRPCA PG-TenRPCA 



Fig. 11. The reconstruction and separation (RS) results of different methods 
on six videos. The first three videos show the RS results for a high sampling 
ratio 0.2, while the last three videos for a low sampling ratio 0.04. 


TABLE VI 

Comparison with the online method ReProCS. “syn” indicates 

THE SYNTHETIC VIDEO. VIDEOS (E), (H), AND (P) ARE SHOWN IN FlG.|^ 


Video 

Indices 

ReProCS H-TenRPCA 

PG-TenRPCA 



Sampling Ratio : 

= 0.75 



PSNR 

25.56 

22.34 

41.6157 

syn 

SSIM 

0.9612 

0.5858 

0.9890 


F-measure 

0.8767 

0.8664 

0.8652 


PSNR 

15.23 

17.62 

42.29 

(f) 

SSIM 

0.8160 

0.4119 

0.9891 


F-measure 

0.8098 

0.8218 

0.8120 


PSNR 

16.93 

23.84 

42.71 

(h) 

SSIM 

0.8780 

0.5312 

0.9832 


F-measure 

0.7136 

0.8217 

0.7324 


PSNR 

14.45 

16.48 

45.09 

(P) 

SSIM 

0.6838 

0.3295 

0.9905 


F-measure 

0.6845 

0.8524 

0.8383 



Sampling Ratio 

= 0.5 



PSNR 

25.32 

19.12 

33.7437 

syn 

SSIM 

0.9509 

0.4322 

0.9541 


F-measure 

0.8730 

0.8569 

0.8540 


PSNR 

15.13 

14.16 

36.44 

(f) 

SSIM 

0.7948 

0.2739 

0.9668 


F-measure 

0.8021 

0.8184 

0.8141 


PSNR 

16.95 

20.48 

36.02 

(h) 

SSIM 

0.8760 

0.3931 

0.9375 


F-measure 

0.7129 

0.8057 

0.7437 


PSNR 

14.52 

13.12 

38.98 

(P) 

SSIM 

0.6146 

0.2225 

0.9688 


F-measure 

0.6876 

0.8395 

0.8289 



Sampling Ratio : 

= 0.25 



PSNR 

24.94 

17.22 

24.42 

syn 

SSIM 

0.9360 

0.3281 

0.7054 


F-measure 

0.8617 

0.8548 

0.8434 


PSNR 

14.99 

12.31 

21.37 

(f) 

SSIM 

0.7524 

0.1874 

0.6263 


F-measure 

0.7537 

0.7820 

0.7890 


PSNR 

17.02 

18.59 

26.68 

(h) 

SSIM 

0.8756 

0.3225 

0.7092 


F-measure 

0.7090 

0.8110 

0.8073 


PSNR 

13.69 

11.16 

20.67 

(P) 

SSIM 

0.4529 

0.1548 

0.5502 


F-measure 

0.5284 

0.8368 

0.8151 


as the same for each frame because of the constraint of 
the ReProCS method on compressive operator, i.e., Ad = 
D • H • P (d = 1, 2, • • • ,D). The sampling ratios in this group 
of experiments are set as 0.75, 0.5, and 0.25, respectively. 

We exhibit the reconstruction and separation results in 
Table VI From this table, we can find that our proposed 
methods almost outperform the ReProCS method in terms of 
F-measure index for video separation (foreground detection). 
This good detection performance on video foreground can be 
attributed to the favor of spatio-temporal continuity from 3D 
total variation. Moreover, for video reconstruction, the PG- 
TenRPCA method is superior over methods H-TenRPCA and 
ReProCS in terms of PSNR and SSIM indices. Additionally, 
when the sampling ratio is very low, on some videos, e.g., 
the synthetic video and video (p), the ReProCS method can 
reconstruct a better video than the H-TenPRCA method in 
terms of SSIM and PSNR indices. This is because the pre¬ 
trained video background provides sufficient information for 
the ReProCS method; however, for our proposed methods, 
the pre-training procedure is not required. These findings 
can be further supported in Fig. where we exhibit the 
reconstruction and separation results of one video with a high 
sampling ratio of 0.75 and a low sampling ratio of 0.25. 

3) Computational Speeds: We compare the running time of 
different models on the resized video “ShoppingMall” of the 
size 64 X 64 X 128. It is noted that this kind of comparison 
in terms of the running time is only illustrative. The running 
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time (in seconds) for the BSCM task by SpaRCS, SpLR, Re- 
ProCS, H-TenRPCA, and PG-TenRPCA are 14.4606, 2147.6, 
995.9870, 93.3779, and 1015.5 respectively. It can be observed 
that the SpaRCS method is the fastest method among all 
the compared methods. But it cannot achieve comparable 
video reconstruction and separation performance compared 
to the other methods. The ReProCS and SpLR methods 
require solving an expensive linear system, which is highly 
computational expensive. For our proposed basic model H- 
TenRPCA, as the compressive operator used in experiments is 
orthogonal in columns, then each sub-problem has the closed- 
form solution, which makes H-TenRPCA relatively fast. It is 
not hard to see that the search of similar 3D patches and the 
joint Tucker decomposition in PG-TenRPCA both consume 
expensive computational cost, which makes it relatively slow. 
However, as shown in Fig. 5, for each cluster composed of 
similar 3D patches, all related computations can be performed 
in a parallel way. Then, the computational cost will be greatly 
reduced if more processors are provided. The parallelization 
of our optimization algorithm for PG-TenRPCA deserves us 
to investigate in the future work. 


D. Effect of Compressive Operators 

In this subsection, we will report the reconstruction and 
separation performance of our proposed models based on the 
measurements of videos captured with different compressive 
operators. The compressive operator is chosen as the randomly 
permuted Walsh-Hardamard transform in the holistic and 
frame-wise manner (WHT-h and WHT-f), and the randomly 
permuted noiselet transform in the holistic and frame-wise 
manner (Noiselet-h and Noiselet-f), respectively. 


In Table |VII| and |VIII| we show the quantitative results 
of the proposed models with different sampling ratios on 
the synthetic and “ShoppingMall” video, respectively. From 
these two tables, we can observe that when the sampling 
ratio goes down, the results based on these four compressive 
operators deteriorate in terms of PSNR, SSIM and F-measure; 
the results based on WHT-h (WHT-f) is comparable to those 
based on Noiselet-h (Noiselet-f); and the results based on 
the compressive operator in the holistic manner (WHT-h and 
Noiselet-h) are slightly better than those based on the frame- 
wise compressive operator. It is worthy to point out that our 
proposed models can be incorporated with any compressive 
operator in the same framework. 


VII. Conclusion 

In this paper, we proposed a novel tensor-based robust 
PCA approach for background subtraction from compressive 
measurements, in which Tucker decomposition is utilized to 
model the spatio-temporal correlation of the background in 
video streams, and 3D-TV is employed to characterize the 
smoothness of video foreground. Furthermore, we proposed 
an improved tensor RPCA model that models the video 
background as several tensors over groups of similar video 
patches, taking advantages of the strong correlations of these 
patches in each patch group. Extensive experiments on syn¬ 
thetic and real-world data sets are conducted to demonstrate 


(b) 

SR=0.25 



Ground-Truth ReProCS H-TenRPCA PG-TenRPCA 


Fig. 12. Comparison of our methods H-TenRPCA and PG-TenRPCA with 
the online method ReProCS in reconstruction and separation performance. 


TABLE VII 

Effect of different compressive operators on the synthetic 

VIDEO. 


WHT-f WHT-h Noiselet-f Noiselet-h 

H-TenRPCA PG-TenRPCA H-TenRPCA PG-TenRPCA H-TenRPCA PG-TenRPCA H-TenRPCA PG-TenRPCA 
42.33 40.15 42.48 43.12 42.36 43.39 42.46 43.78 

1/5 0.9918 0.9894 0.9922 0.9945 0.9919 0.9947 0.9921 0.9953 

0.8624 0.8598 0.8616 0.8588 0.8605 0.8582 0.8614 0.8585 

34.95 34.38 34.96 35.86 35.18 35.54 35.33 35.67 

1/10 0.9695 0.9673 0.9704 0.9767 0.9705 0.9746 0.9722 0.9756 

0.8616 0.8566 0.8604 0.8638 0.8608 0.8603 0.8615 0.8614 

30.64 30.40 30.71 31.32 30.88 31.35 30.81 31.37 

1/20 0.9275 0.9299 0.9307 0.9459 0.9274 0.9457 0.9311 0.9461 

0.8337 0.8342 0.8277 0.8318 0.8345 0.8374 0.8290 0.8357 

28.80 29.61 28.87 30.31 28.96 30.34 28.89 30.21 

1/25 0.8901 0.9236 0.8966 0.9349 0.8818 0.9350 0.8962 0.9337 

0.8182 0.8240 0.8132 0.8221 0.8202 0.8272 0.8137 0.8219 

27.36 28.88 27.72 29.46 27.05 29.52 28.03 29.47 

1/30 0.8492 0.9133 0.8635 0.9229 0.8189 0.9237 0.8710 0.9227 

0.8086 0.8148 0.8006 0.8114 0.8100 0.8168 0.8024 0.8171 


TABLE VIII 

EEEECT OE DIEEERENT compressive operators on THE 
“ShoppingMall” video. 


WHT-f WHT-h Noiselet-f Noiselet-h 

H-TenRPCA PG-TenRPCA H-TenRPCA PG-TenRPCA H-TenRPCA PG-TenRPCA H-TenRPCA PG-TenRPCA 
41.01 40.25 41.07 40.40 40.94 40.43 41.07 40.53 

1/5 0.9768 0.9736 0.9772 0.9745 0.9767 0.9745 0.9771 0.9761 

0.6714 0.6672 0.6697 0.6671 0.6715 0.6701 0.6707 0.6680 

36.94 36.38 37.06 36.45 36.93 36.47 37.07 36.48 

1/10 0.9569 0.9535 0.9583 0.9549 0.9576 0.9549 0.9588 0.9554 

0.6635 0.6568 0.6626 0.6591 0.6591 0.6565 0.6620 0.6609 

31.07 32.48 31.34 32.52 31.29 32.57 31.64 32.54 

1/20 0.8715 0.9197 0.8779 0.9197 0.8726 0.9202 0.8861 0.9206 

0.6186 0.6314 0.6179 0.6345 0.6155 0.6309 0.6221 0.6310 

29.85 31.47 29.96 31.49 28.87 30.11 30.57 31.41 

1/25 0.8259 0.9058 0.8384 0.9081 0.7740 0.9047 0.8414 0.9060 

0.6008 0.6146 0.5941 0.6172 0.5922 0.6082 0.5964 0.6126 

28.38 30.71 28.72 30.57 26.73 29.49 28.71 30.57 

1/30 0.7561 0.8948 0.7870 0.8940 0.6315 0.8911 0.7868 0.8916 

0.5842 0.5964 0.5874 0.6073 0.5677 0.5852 0.5764 0.5937 
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the superiority of proposed approaches over the existing state- 
of-the-art approaches. 

In the future work, we are interested in the following 
research directions. First, model the layers of the foregrounds 
using mixture of Gaussian to enhance its encoding capability 
for complex configured foreground. Second, develop better 
model for the complex background, such as dynamic back¬ 
ground with illumination change, smog or snow, and so on. 
Third, incorporate the motion of cameras into our proposed 
models. Finally, develop online version of our approach to 
make it more effective, thus facilitating the further use for 
more practical scenarios. 

Appendix A 

This optimization problem can be approximately solved 
by the alternating direction method (ADM). Firstly, fixing 
the orthogonal factors Ui^, U2p, U3, and we have 

0pXlUipX2U2pX3U3X4U4p|||. = ||7^p(A'i)Xi 

Jjjp X2 IJ2P X3 U3 X4 Jjjp - GpWf- Hence, it follows that 
Gp — Xi \Jip X2 ^2p X3 U3 X4 U4p. 

Then, using the solution of Qp we further derive that 

ll^p(^l) — Gp Xl Uip X2 U 2 p X3 U3 X4 U4p|||. = 
II^p(^i)IIf ~ 2 ( 7 ^p(A’i),X i IJip X2 U 2 p X3 U3 X4 U4p) 
^\\GpJf = 

II^p(-^i)IIf- ll^p(-^i) XiU?;, X2VI XsVj x^Ul\\%. 
The factor matrix Ui^ can be estimated by maximizing 
ll'^p(^i) Xi Jjjp X2 V^p X3 X4 Vlpllj, with mspect 
to Vip. It then easily follows that Ui^ = SVD( 7 ^p(A'i) X2 
^2p X3 U3 X4 U4p)(i),ri). Here, SVD(A,r) indicates top 
r singular vectors of matrix A. Likewise, we can obtain the 
solutions for factor matrixes \J2p and U^p. Finally, the factor 
matrix U3 can be estimated by maximizing W'R'pi^i) Xi 
Jjjp X2 IJ2P X3 U3 X4 VjpWp with respect to U3. It is 
easy to find that U3 = eigs(^^^^ ZpZ^.rs), where Zp = 
( 7 ^p(Al) XiUfp X2V^p X4Ujp)^3^ and eigs(A,r) indicates 
top r eigen vectors of matrix A. 
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